{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e1ec4379-3078-482b-88ab-7d10cf0ce388",
   "metadata": {},
   "source": [
    "# Tour of Buckaroo\n",
    "Buckaroo expedites the core task of data work - looking at the data - by showing histograms and summary stats with every DataFrame.\n",
    "\n",
    "This notebook gives a tour of Buckaroo features.\n",
    "## Running buckaroo\n",
    "\n",
    "Buckaroo runs in many python notebook environments including Jupyter Notebook, Jupyter Lab, [Marimo](https://marimo.io/), VS Code, and Google Colab.\n",
    "\n",
    "to get started install buckaroo in your python environment with pip or uv\n",
    "\n",
    "```bash\n",
    "pip install buckaroo\n",
    "uv add buckaroo\n",
    "```\n",
    "\n",
    "then run \n",
    "\n",
    "```python\n",
    "import buckaroo\n",
    "```\n",
    "\n",
    "in the notebook.  Buckaroo will become the default way of displaying dataframes in that environment.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bb078fa-5cab-44a8-b018-a918a79455fd",
   "metadata": {},
   "source": [
    "## Demonstrating Buckaroo on Citibike data.\n",
    "Click `main` below Σ to toggle the summary stats view.\n",
    "\n",
    "You can click on column headers like \"tripduration\" to cycle through sort. You can also use search to filter rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "607fb803-c9aa-4a89-879a-ac628beaf249",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install buckaroo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4b9065a-1411-46ba-a069-025317207a8a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import buckaroo\n",
    "citibike_df = pd.read_parquet(\"https://github.com/paddymul/buckaroo/raw/refs/heads/main/docs/example-notebooks/citibike-trips-2016-04.parq\")\n",
    "citibike_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1246a669-9f4c-4f48-8648-318d480bf772",
   "metadata": {},
   "source": [
    "## Histograms\n",
    "\n",
    "Histograms are built into Buckaroo. They enable users to quickly identify distributions of data in columns\n",
    "### Common histogram shapes\n",
    "\n",
    "The following shows the most common shapes you will see in histograms, allowing you to quickly identify patterns\n",
    "\n",
    "Notice the three columns on the right. Those are categorical histograms as opposed to numerical histograms\n",
    "## Categorical histograms\n",
    "\n",
    "Categorical histograms have special colors and patterns for NA/NaN, longtail (values that occur at least twice) and unique Categorical histograms are always arranged from most frequent on the left to least frequent on the right.\n",
    "\n",
    "When a column is numerical, but has less than 5 distinct values it is displayed with a categorical histogram, because the numbers were probably flags\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "94a9eb55-6898-4b9d-a6ad-70de6db82763",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "def bimodal(mean_1, mean_2, N, sigma=5):\n",
    "    X1 = np.random.normal(mean_1, sigma, int(N / 2))\n",
    "    X2 = np.random.normal(mean_2, sigma, int(N / 2))\n",
    "    X = np.concatenate([X1, X2])\n",
    "    return X\n",
    "\n",
    "\n",
    "def rand_cat(named_p, na_per, N):\n",
    "    choices, p = [], []\n",
    "    named_total_per = sum(named_p.values()) + na_per\n",
    "    total_len = int(np.floor(named_total_per * N))\n",
    "    if named_total_per > 0:\n",
    "        for k, v in named_p.items():\n",
    "            choices.append(k)\n",
    "            p.append(v / named_total_per)\n",
    "        choices.append(pd.NA)\n",
    "        p.append(na_per / named_total_per)\n",
    "        return [np.random.choice(choices, p=p) for k in range(total_len)]\n",
    "    return []\n",
    "\n",
    "\n",
    "def random_categorical(named_p, unique_per, na_per, longtail_per, N):\n",
    "    choice_arr = rand_cat(named_p, na_per, N)\n",
    "    discrete_choice_len = len(choice_arr)\n",
    "\n",
    "    longtail_count = int(np.floor(longtail_per * N)) // 2\n",
    "    extra_arr = []\n",
    "    for i in range(longtail_count):\n",
    "        extra_arr.append(\"long_%d\" % i)\n",
    "        extra_arr.append(\"long_%d\" % i)\n",
    "\n",
    "    unique_len = N - (len(extra_arr) + discrete_choice_len)\n",
    "    for i in range(unique_len):\n",
    "        extra_arr.append(\"unique_%d\" % i)\n",
    "    all_arr = np.concatenate([choice_arr, extra_arr])\n",
    "    np.random.shuffle(all_arr)\n",
    "    try:\n",
    "        return pd.Series(all_arr, dtype=\"UInt64\")\n",
    "    except:\n",
    "        return pd.Series(all_arr, dtype=pd.StringDtype())\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f8562580-b04a-4bf8-9935-278afc6c8d1e",
   "metadata": {},
   "outputs": [],
   "source": [
    "N = 4000\n",
    "\n",
    "# random_categorical and bimodal are defined in a hidden code block at the top of this notebook\n",
    "histogram_df = pd.DataFrame(\n",
    "    {\n",
    "        \"normal\": np.random.normal(25, 0.3, N),\n",
    "        \"3_vals\": random_categorical({\"foo\": 0.6, \"bar\": 0.25, \"baz\": 0.15}, unique_per=0, na_per=0, longtail_per=0, N=N),\n",
    "        \"all_unique\": random_categorical({}, unique_per=1, na_per=0, longtail_per=0, N=N),\n",
    "        \"bimodal\": bimodal(20, 40, N),\n",
    "        \"longtail_unique\": random_categorical({1:.3, 2:.1}, unique_per=.1, na_per=.3, longtail_per=.2, N=N),\n",
    "        \"one\": [1] * N,\n",
    "        \"increasing\": [i for i in range(N)],\n",
    "\n",
    "        \"all_NA\": pd.Series([pd.NA] * N, dtype=\"UInt8\"),\n",
    "        \"half_NA\": random_categorical({1: 0.55}, unique_per=0, na_per=0.45, longtail_per=0.0, N=N),\n",
    "\n",
    "        \"longtail\": random_categorical({}, unique_per=0, na_per=0.2, longtail_per=0.8, N=N),\n",
    "    }\n",
    ")\n",
    "histogram_df\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "148d947d-2dfc-4e69-a2fe-3623d2fb0105",
   "metadata": {},
   "source": [
    "## Auto cleaning and the lowcode UI\n",
    "Dealing with dirty data accounts for a large portion of the time in doing data work. We know what good data looks like, and we know the individual pandas commands to clean columns. But we have to type the same commands over and over again.\n",
    "\n",
    "This also shows the Lowcode UI, which is revealed by clicking the checkbox below λ (lambda).  The lowcode UI has a series of commands that can be executed on columns. Commands are added to the operations timeline (similar to CAD timelines).\n",
    "\n",
    "Additonal resources\n",
    "\n",
    "* [Autocleaning notebook](https://marimo.io/p/@paddy-mullen/buckaroo-auto-cleaning)\n",
    "* [Autocleaning in depth](https://www.youtube.com/watch?v=A-GKVsqTLMI) Video explaining how to write your own autocleaning methods and heuristic strategies\n",
    "* [JLisp explanation](https://youtu.be/3Tf3lnuZcj8) The lowcode UI is backed by a small lisp interpreter, this video explains how it works. Don't worry, you will never have to touch lisp to use buckaroo.\n",
    "* [JLisp notebook](https://marimo.io/p/@paddy-mullen/jlisp-in-buckaroo)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44cef6ca-ec7a-49f6-8e03-ab46ea5fbc95",
   "metadata": {},
   "outputs": [],
   "source": [
    "dirty_df = pd.DataFrame(\n",
    "    {\n",
    "        \"a\": [10, 20, 30, 40, 10, 20.3, None, 8, 9, 10, 11, 20, None],\n",
    "        \"b\": [\"3\", \"4\", \"a\", \"5\", \"5\", \"b9\", None, \" 9\", \"9-\", 11, \"867-5309\", \"-9\", None],\n",
    "        \"us_dates\": [ \"\", \"07/10/1982\", \"07/15/1982\", \"7/10/1982\", \"17/10/1982\", \"03/04/1982\",\n",
    "                      \"03/02/2002\", \"12/09/1968\", \"03/04/1982\", \"\", \"06/22/2024\", \"07/4/1776\", \"07/20/1969\"],\n",
    "        \"mostly_bool\": [ True, \"True\", \"Yes\", \"On\", \"false\", False, \"1\", \"Off\", \"0\", \" 0\", \"No\", 1, None],\n",
    "    }\n",
    ")\n",
    "from buckaroo.buckaroo_widget import AutocleaningBuckaroo\n",
    "AutocleaningBuckaroo(dirty_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30cb0d61-5ecb-41a9-8c6e-9021b0ac5c68",
   "metadata": {},
   "source": [
    "## Styling Buckaroo\n",
    "\n",
    "Buckaroo offers many ways to style tables.  Here is an example of applying a heatmap to a column. This colors the `bimodal` column based on the value of the `normal` column.\n",
    "\n",
    "You can see more styles in the [Buckaroo Styling Gallery](https://marimo.io/p/@paddy-mullen/buckaroo-styling-gallery).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "08c03c1b-38bc-4979-8624-41294ada5505",
   "metadata": {},
   "outputs": [],
   "source": [
    "from buckaroo import BuckarooInfiniteWidget\n",
    "#not that here we have to explicitly call Buckaroo to pass options\n",
    "BuckarooInfiniteWidget(histogram_df, column_config_overrides={\"bimodal\": {\"color_map_config\": {\"color_rule\": \"color_map\", \"map_name\": \"DIVERGING_RED_WHITE_BLUE\", \"val_column\": \"normal\"}}})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3203a2e-2bbc-4de9-a204-7ab86cd6363a",
   "metadata": {},
   "source": [
    "## Extending Buckaroo\n",
    "Buckaroo is very extensible. I think of Buckaroo as a framework for building table applications, and an exploratory data analysis tool built with that framework.\n",
    "\n",
    "Let's start with a post processing function. Post processing functions let you modify the displayed dataframe with a simple function.  In this case we will make a \"only_outliers\" function which only shows the 1st and 99th quintile of each numeric row\n",
    "\n",
    "the `.add_processing` decorator adds the post processing function to the BuckarooWidget and enables it\n",
    "to cycle between post processing functions click below `post_processing`  Note how total_rows stays constant and filtered changes.\n",
    "\n",
    "Custom summary stats and styling configurations can also be added. The [Extending Buckaroo](https://www.youtube.com/watch?v=GPl6_9n31NE) video explains how."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25796bbf-43d4-4ded-a690-c41856788c72",
   "metadata": {},
   "outputs": [],
   "source": [
    "bw = BuckarooInfiniteWidget(citibike_df)\n",
    "\n",
    "@bw.add_processing\n",
    "def outliers(df):\n",
    "    mask = pd.Series(False, index=df.index)\n",
    "    for col in df.select_dtypes(include=[\"int64\", \"float64\"]).columns:\n",
    "        ser = df[col]\n",
    "        p1, p99 = ser.quantile(0.01), ser.quantile(0.99)\n",
    "        mask |= (ser <= p1) | (ser >= p99)\n",
    "    return df[mask]\n",
    "bw\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08facbb6-0661-43a6-aa15-e24bcf576d9f",
   "metadata": {},
   "source": [
    "## Try Buckaroo\n",
    "Give buckaroo a try.  It works in Marimo, Jupyter, VSCode, and Google Colab\n",
    "```\n",
    "pip install buckaroo\n",
    "# or \n",
    "uv add buckaroo\n",
    "```\n",
    "\n",
    "Give us a star on [github](https://github.com/paddymul/buckaroo)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ced9592-3451-4623-bef5-fdadebe31be1",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.2"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
